Detection of Outliers in Twstft Data Used in Tai

نویسندگان

  • A. Harmegnies
  • G. Panfilo
  • E. F. Arias
چکیده

This paper describes a new filtering technique used to detect and eliminate outlying data in two-way satellite time and frequency transfer (TWSTFT) time links. In the case of TWSTFT data used to calculate International Atomic Time (TAI), three main problems have to be considered: the difficulty of recognizing outliers from useful data; the need to avoid deleting useful data; that TWSTFT links can show an underlying slope which renders the standard treatment more difficult. Using phase and frequency filtering techniques, a new way of detecting outliers, while avoiding detection of useful data, has been developed and implemented at the BIPM to clean TWSTFT data. INTRODUCTION Each month, the BIPM Time, Frequency, and Gravimetry Section produces International Atomic Time (TAI) and Coordinated Universal Time (UTC) from data contributed by almost 70 laboratories. This involves about 370 atomic clocks linked by various techniques. Most time links (85%) are computed from multi-channel GPS receivers, either singleor dual-frequency; 14% of the links are from TWSTFT [1,2] observations in Europe, North America, and the Asia-Pacific region. In September 2009, a new technique based on the carrier phase combined with the code of the GPS signal (Precise Point Positioning, GPS PPP [3]) was introduced into BIPM Circular T [4]. The BIPM treats, in total, TWSTFT observations from 20 laboratories, half of which currently are under study prior to their inclusion in the routine TAI calculation. Several TWSTFT links are affected by outlying data; to ensure safe handling of the data, a new cleaning technique has been developed at the BIPM. Although the data-cleaning technique has been developed for application to TWSTFT links, it could equally be adapted to other kinds of time links. Removing the outliers on TWSTFT links is a challenge for a number of reasons: it is difficult to recognize outliers from useful data; the TWSTFT links may show an underlying slope which complicates the standard treatment [5-10]; and finally, the number of TWSTFT data points is rather low. TWSTFT time links routinely provide 12 data points per day (one measurement every 2 hours in most cases, and every hour in the case of Asia-Europe time links); this number is rather low compared to the number of measurements for GPS time links (about 100 measurements per day). 41st Annual Precise Time and Time Interval (PTTI) Meeting 422 The first part of this paper illustrates the various approaches tested using phase and frequency data. The second part shows the results obtained using the new filtering technique. THE OUTLIER DETECTION TECHNIQUE The techniques most commonly used to detect outliers in data used in the calculation of TAI [5] are based on different statistical estimators applied to the phase data. However, TWSTFT links may show an underlying slope that makes the standard treatment more difficult and increases the risk of removing useful data. In this case, a better approach is to consider both phase and frequency values. After testing various approaches, we concluded that mixing different methods increases the reliability of the filter. A first step called “Rough cleaning” consists of detecting very large outliers by using a moving average on the phase data and observing the residuals between the real and the filtered values. The second step consists of a more refined detection, making use of two different mathematical methods: 1. the moving average applied to phase data; 2. the Median Absolute Deviation (MAD) estimator applied to the frequency data. A data point is removed only when both techniques identify it as an outlier. This new outlier detection process gives satisfying results when applied to TWSTFT links without removing too many data and complements the existing cleaning tools developed for time link data [5]. In the next sections, the mathematical techniques used in the algorithm for outlier detection are presented. FREQUENCY FILTERING The mean and standard deviation, standard estimators used to characterize the properties of data set, can be really affected by outliers [7]. An estimator more resistant to outliers is the median. In this study, we consider a robust estimator, based on the median, called Median Absolute Deviation (MAD) [6-8,10], which is frequently used in data sets affected by outliers. The MAD is defined as the median of the absolute deviations from the data median: ) ) ( ( j j i i X median X median MAD (1) The classical “standard deviation” can be estimated using the MAD [6]: MAD K MAD ˆ (2) where K is a constant scale factor which depends on the type of the distribution: for normal distribution of data, K 4826 . 1 [6,7]. Consequently, by choosing K = 1.4826 the expected value of MAD ˆ is equal to the standard deviation for normally distributed data. To test the robustness of the MAD with respect to the standard deviation in the case of TAI calculation, a test was performed using the TWSTFT data reported to the BIPM in February 2009 (hereafter, 0902). We calculate the classical standard deviation and the estimated standard deviation using the MAD of the frequency data for all TWSTFT links. The MAD was found to be more robust than the classical standard 41st Annual Precise Time and Time Interval (PTTI) Meeting 423 deviation. Figure 1 shows the histogram of the standard deviation of the links (in blue) and the standard deviation calculated by using the MAD (in red). 0 1E-13 2E-13 3E-13 4E-13 5E-13 6E-13 AOS CH IT NICT NIST NPL OP ROA SP USNO USNX VSL TWSTFT link with PTB F re q u e n c y Standard deviation MAD estimated standard deviation Figure 1. Comparison between the standard deviation and the standard deviation estimated using the MAD for TWSTFT data of February 2009. In month 0902, outliers are mainly present in the USNO-PTB X-band link (USNX), in the NICT-PTB link (NICT), and in the NPL-PTB link (NPL). The behavior of the statistical tools is different in each case, but in each case the MAD was found to be more resistant to outliers. According to Sesia and Tavella [8], the MAD is the most widely used filter to detect and remove outliers. To create the filter, a threshold t has to be defined so that a value i X is considered an outlier if: MAD j j i t X median X ˆ ) ( (3) In the case of our filter, the threshold was defined as t = 3, which corresponds to about 1% of outliers if the data are normally distributed [7]. This filter is applied to frequency data derived from phase data using the well-known relations: dt t x d t y ) ( ) ( i i i i x x y 1 (4) where y (t) is the frequency value and x (t) is the phase value at time t. TWSTFT data are usually sampled every hour (or every 2 hours), so τ = 1 hour (or 2 hours) (standard time interval). A time interval τi bigger than between two successive observations indicates a hole in the data. In such cases, y(t) will be affected and the value will be wrong. For this reason, holes in phase data are detected before computing the MAD and only frequency data not corresponding to a hole are used to calculate the MAD (1). Once the MAD is determined, the complete data set is treated to define outliers with respect to the calculated MAD. The frequency values can be obtained by considering different values of τi. The use of frequency data leads to several difficulties linked to the identification of the corresponding phase data that generate frequency outliers. Several examples are shown in the four plots in Figure 2. 41st Annual Precise Time and Time Interval (PTTI) Meeting 424 0 10 20 30 40 50 60 70 0 2 4 6 8 10 12 14 n° of observation P h a s e / n s -50 -40 -30 -20 -10 0 10 20 F re q u e n c y Phase Frequency 0 20 40 60 80 100 120 0 2 4 6 8 10 12 14 n° of observation P h a s e / n s -80 -60 -40 -20 0 20 40 F re q u e n c y Phase Frequency Figure 2. Representation of specific behaviors in phase and corresponding frequency data: (a) a single phase outlier; (b) successive phase outliers; (c) time step in phase data; and (d) a more complex case with successive series of outliers. Sesia and Tavella [8] consider that when frequency outliers are detected, the two corresponding phase data should be eliminated. It is obvious that, in such cases, more phase data than necessary are cleaned, considering that two phase data are used to produce one frequency value. This solution was not applicable to our case because of the small quantity of data in TWSTFT time links. The frequency data have to be carefully checked to detect the corresponding phase data outlier. We analyzed a number of data after the outlier detection and checked the sign of the frequency jump. If a time step occurs (Figure 2 (c)), the frequency outlier data corresponding to the end of the observation period does not exist and, thus, no data are modified. In the case that frequency outliers corresponding to successive phase outliers are found, all points are considered as outliers (Figure 2 (b)). This method is very important in order to distinguish between successive phase outliers and a time step. If we remove the two phase data corresponding to the frequency outlier, we might remove useful data or miss real outliers. Figure 3 shows as an example the case of the USNO-PTB X-band link. The filter is efficient in the presence of a single outlier, but if several close outliers appear, then the filter fails, such as for the data prior to MJD 54869.5. By checking the frequency values with the sign, the filter works very well. 0 10 20 30 40 50 60 70 0 2 4 6 8 10 12 14 n° of observation P h a s e / n s -40 -30 -20 -10 0 10 20 F re q u e n c y Phase Frequency a) b) c) d) 0 10 20 30 40 50 60 70 0 2 4 6 8 10 12 14 n° of observation P h a s e / n s -50 -40 -30 -20 -10 0 10 20 F re q u e n c y Phase Frequency 41st Annual Precise Time and Time Interval (PTTI) Meeting 425 USNO-PTB X-band (0902) -35 -30 -25 -20 -15 -10 -5 0 54869 54869.5 54870 54870.5 54871 54871.5 54872 MJD P h a s e / n s -1E-11 -8E-12 -6E-12 -4E-12 -2E-12 0E+00 2E-12 4E-12 F re q u e n c y Original phase Cleaned phase Frequency Figure 3. Original and cleaned data for the USNO-PTB X-band link. The blue line (♦) represents the real phase data, the black line links the frequency data (▲), and the red dots (×) shows the cleaned data. Figure 2(d) shows a more complicated situation where the interpretation of the behavior of the frequency values becomes difficult. Without examining the phase values, there is no way of detecting whether the last datum of a short time jump has a useful (“normal”) value. Since the goal is avoid the deletion of useful data, a second filter on the phase data is added to check the correct outlier detection. This second filter is described in the next section. PHASE FILTERING To ensure the removal of only real phase data outliers, a second test is performed on the phase data. We filter the data with the moving average technique, and detect the outliers by analyzing the residuals between the filtered and real data. The width of the window used in the phase filtering can be adopted depending on the noise presents in the data set. We decided to use 12 data in the moving average, but a smaller size can be required if the drift of the phase is high; and the window can be enlarged if large data holes exist. A maximum accepted value of the residuals between real and smoothed data must be fixed. To determine this value, the cumulated frequencies of the residuals were calculated for different time links in different periods. The results are reported in Figure 4. It is seen, for example, that the NICT-PTB time link for two different months shows very different results and it is therefore difficult to set a default value perfectly adapted to every situation. We chose to set the residual threshold Z equal to 2 ns because most residuals are located between 2 ns and +2 ns independently of the link used. However, this parameter can be adapted to special cases to increase the efficiency of the filter. 41st Annual Precise Time and Time Interval (PTTI) Meeting 426 Cumulated frequencies of residuals 0 10 20 30 40 50 60 70 80 90 100 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 (ns) % ( re s id u a ls > ) NICT 0806 NICT 0901 OP 0806 OP 0901 Figure 4. Cumulated frequency of residuals between the filtered and real data for different time links. This filter can be used for outlier detection and the results in most cases will be satisfactory. However, in the calculation of time links for TAI, we are often faced with very complex situations where the detection of useful data is not straightforward. One example is reported in Figure 5, showing the results obtained by applying the phase filter to the NICT-PTB time link (data period 0806). In this situation, we should avoid using these data in the TAI calculation and consider an alternative technique. We have retained this case as a good example to check and refine the outlier detection technique. NICT-PTB TWSTFT time link (0806)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

A robust wavelet based profile monitoring and change point detection using S-estimator and clustering

Some quality characteristics are well defined when treated as response variables and are related to some independent variables. This relationship is called a profile. Parametric models, such as linear models, may be used to model profiles. However, in practical applications due to the complexity of many processes it is not usually possible to model a process using parametric models.In these cas...

متن کامل

Introduction Package CircOutlier For Detection of Outliers in Circular-Circular Regression

One of the most important problem in any statistical analysis is the existence of unexpected observations. Some observations are not a part of the study and are known as outliers. Studies have shown that the outliers affect to the performance of statistical standard methods in models and predictions. The point of this work is to provide a couple of statistical package in R software to identi...

متن کامل

A statistical test for outlier identification in data envelopment analysis

In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results. In these ‘‘deterministic’’ frontier models, statistical theory is now mostly available. This paper deals with the statistical pared sample method and its capability of detecting outliers in data envelopment analysis. In the prese...

متن کامل

Simultaneous robust estimation of multi-response surfaces in the presence of outliers

A robust approach should be considered when estimating regression coefficients in multi-response problems. Many models are derived from the least squares method. Because the presence of outlier data is unavoidable in most real cases and because the least squares method is sensitive to these types of points, robust regression approaches appear to be a more reliable and suitable method for addres...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010